在本文中,我们介绍了MCCE:Monte Carlo采样的现实反事实解释,一种基于模型的方法,通过使用条件推理树产生一组可行的例子来产生反事实解释。与必须求解复杂优化问题或基于其他模型的基于模型的方法不同的基于算法的反事实方法,这些方法使用重型机器学习模型模拟数据分布,MCCE仅由两个轻量级步骤(生成和后处理)组成。对于最终用户而言,MCCE也是直接的,用于理解和实现,处理任何类型的预测模型和类型的特征,考虑到产生反事实说明时,并根据需要产生尽可能多的反事实解释。在本文中,我们介绍了MCCE,并提供了可用于比较反事实解释的全面的性能指标列表。我们还将MCCE与一系列最先进的方法和基准数据集的新基线方法进行比较。 MCCE在考虑有效性(即,正确改变的预测)和可操作性约束时,MCCE优于所有基于模型的方法和基于算法的方法。最后,我们表明MCCE在仅在训练数据的小型子集时,几乎具有效果的实力。
translated by 谷歌翻译
最近有一项激烈的活动在嵌入非常高维和非线性数据结构的嵌入中,其中大部分在数据科学和机器学习文献中。我们分四部分调查这项活动。在第一部分中,我们涵盖了非线性方法,例如主曲线,多维缩放,局部线性方法,ISOMAP,基于图形的方法和扩散映射,基于内核的方法和随机投影。第二部分与拓扑嵌入方法有关,特别是将拓扑特性映射到持久图和映射器算法中。具有巨大增长的另一种类型的数据集是非常高维网络数据。第三部分中考虑的任务是如何将此类数据嵌入中等维度的向量空间中,以使数据适合传统技术,例如群集和分类技术。可以说,这是算法机器学习方法与统计建模(所谓的随机块建模)之间的对比度。在论文中,我们讨论了两种方法的利弊。调查的最后一部分涉及嵌入$ \ mathbb {r}^ 2 $,即可视化中。提出了三种方法:基于第一部分,第二和第三部分中的方法,$ t $ -sne,UMAP和大节。在两个模拟数据集上进行了说明和比较。一个由嘈杂的ranunculoid曲线组成的三胞胎,另一个由随机块模型和两种类型的节点产生的复杂性的网络组成。
translated by 谷歌翻译
The distributed representation of symbols is one of the key technologies in machine learning systems today, playing a pivotal role in modern natural language processing. Traditional word embeddings associate a separate vector with each word. While this approach is simple and leads to good performance, it requires a lot of memory for representing a large vocabulary. To reduce the memory footprint, the default embedding layer in spaCy is a hash embeddings layer. It is a stochastic approximation of traditional embeddings that provides unique vectors for a large number of words without explicitly storing a separate vector for each of them. To be able to compute meaningful representations for both known and unknown words, hash embeddings represent each word as a summary of the normalized word form, subword information and word shape. Together, these features produce a multi-embedding of a word. In this technical report we lay out a bit of history and introduce the embedding methods in spaCy in detail. Second, we critically evaluate the hash embedding architecture with multi-embeddings on Named Entity Recognition datasets from a variety of domains and languages. The experiments validate most key design choices behind spaCy's embedders, but we also uncover a few surprising results.
translated by 谷歌翻译
Recent successes of massively overparameterized models have inspired a new line of work investigating the underlying conditions that enable overparameterized models to generalize well. This paper considers a framework where the possibly overparametrized model includes fake features, i.e., features that are present in the model but not in the data. We present a non-asymptotic high-probability bound on the generalization error of the ridge regression problem under the model misspecification of having fake features. Our high-probability results characterize the interplay between the implicit regularization provided by the fake features and the explicit regularization provided by the ridge parameter. We observe that fake features may improve the generalization error, even though they are irrelevant to the data.
translated by 谷歌翻译
We focus on the continual learning problem where the tasks arrive sequentially and the aim is to perform well on the newly arrived task without performance degradation on the previously seen tasks. In contrast to the continual learning literature focusing on the centralized setting, we investigate the distributed estimation framework. We consider the well-established distributed learning algorithm \cocoa{}. We derive closed form expressions for the iterations for the overparametrized case. We illustrate the convergence and the error performance of the algorithm based on the over/under-parametrization of the problem. Our results show that depending on the problem dimensions and data generation assumptions, \cocoa{} can perform continual learning over a sequence of tasks, i.e., it can learn a new task without forgetting previously learned tasks, with access only to one task at a time.
translated by 谷歌翻译
Using 3D CNNs on high resolution medical volumes is very computationally demanding, especially for large datasets like the UK Biobank which aims to scan 100,000 subjects. Here we demonstrate that using 2D CNNs on a few 2D projections (representing mean and standard deviation across axial, sagittal and coronal slices) of the 3D volumes leads to reasonable test accuracy when predicting the age from brain volumes. Using our approach, one training epoch with 20,324 subjects takes 40 - 70 seconds using a single GPU, which is almost 100 times faster compared to a small 3D CNN. These results are important for researchers who do not have access to expensive GPU hardware for 3D CNNs.
translated by 谷歌翻译
Large annotated datasets are required to train segmentation networks. In medical imaging, it is often difficult, time consuming and expensive to create such datasets, and it may also be difficult to share these datasets with other researchers. Different AI models can today generate very realistic synthetic images, which can potentially be openly shared as they do not belong to specific persons. However, recent work has shown that using synthetic images for training deep networks often leads to worse performance compared to using real images. Here we demonstrate that using synthetic images and annotations from an ensemble of 10 GANs, instead of from a single GAN, increases the Dice score on real test images with 4.7 % to 14.0 % on specific classes.
translated by 谷歌翻译
Recent work shows that the expressive power of Graph Neural Networks (GNNs) in distinguishing non-isomorphic graphs is exactly the same as that of the Weisfeiler-Lehman (WL) graph test. In particular, they show that the WL test can be simulated by GNNs. However, those simulations involve neural networks for the 'combine' function of size polynomial or even exponential in the number of graph nodes $n$, as well as feature vectors of length linear in $n$. We present an improved simulation of the WL test on GNNs with \emph{exponentially} lower complexity. In particular, the neural network implementing the combine function in each node has only a polylogarithmic number of parameters in $n$, and the feature vectors exchanged by the nodes of GNN consists of only $O(\log n)$ bits. We also give logarithmic lower bounds for the feature vector length and the size of the neural networks, showing the (near)-optimality of our construction.
translated by 谷歌翻译
刚性对象的6D姿势的估计是计算机视觉中的一个基本问题。传统上,姿势估计与确定单一最佳估计有关。但是,单个估计无法表达视觉歧义,在许多情况下,由于对象对称或识别特征的阻塞,这在许多情况下是不可避免的。无法说明姿势的歧义可能会导致后续方法的失败,这是在失败成本高时无法接受的。完全姿势分布的估计与单个估计相反,非常适合表达姿势不确定性。由此激励,我们提出了一种新颖的姿势分布估计方法。对象姿势上概率分布的隐式公式来自对象的中间表示作为一组关键点。这样可以确保姿势分布估计值具有很高的解释性。此外,我们的方法基于保守近似,这导致可靠的估计。该方法已被评估在YCB-V和T-less数据集上旋转分布估计的任务,并在所有对象上可靠地执行。
translated by 谷歌翻译
深度学习培训是一个昂贵的过程,可广泛使用GPU,但并非所有模型训练都饱和现代强大的GPU。 Multi-Instance GPU(MIG)是NVIDIA引入的一项新技术,可以分区GPU,以更好地适合不需要所有内存和计算完整GPU的资源的工作负载。在本文中,我们研究了在深度学习工作负载下的三种尺寸工作负载下的MIG启用A100 GPU的性能,这些尺寸重点是使用Resnet模型进行图像识别培训。当在GPU允许的各种MIG实例上孤立运行时,我们还研究了这些工作负载的行为,此外还可以在同一GPU共同列入同类的同质实例上并行运行它们。我们的结果表明,当工作负载太小而无法孤立地利用整个GPU时,使用MIG可以显着改善GPU的利用率。通过并行训练多个小型型号,尽管每单位时间的时间增加了,但每单位时间的GPU可以执行更多的工作,导致$ \ sim $ \ sim $ 3倍吞吐量。相比之下,对于已经很好地利用了整个GPU的中型和大型工作量,MIG仅提供边际性能的改进。然而,我们观察到,使用单独的MIG分区并行的训练模型不会表现出强调具有MIG在现代GPU上具有功能的价值的干扰。
translated by 谷歌翻译